Skip to content

Chore/harden perspicacite#11

Open
lfnothias wants to merge 29 commits into
mainfrom
chore/harden-perspicacite
Open

Chore/harden perspicacite#11
lfnothias wants to merge 29 commits into
mainfrom
chore/harden-perspicacite

Conversation

@lfnothias
Copy link
Copy Markdown
Collaborator

Summary

Checklist

  • Tests pass locally: uv run pytest
  • Linting passes: uv run ruff check src/ tests/
  • Type check passes: uv run mypy src/
  • Changelog / ROADMAP updated if this is a user-facing change
  • Docs updated if the public API surface changed (CLI flags, REST endpoints, MCP tools, config keys)
  • New dependencies reviewed for license compatibility (Apache 2.0-compatible only)
  • No unrelated cleanup mixed in (keep diffs focused)
  • CLA signed if this is an external contribution (see CONTRIBUTING.md)

Test plan

Related issues

lfnothias and others added 29 commits May 28, 2026 15:08
…generate_report

generate_report returned LLM-synthesized `report` text alongside factual
`sources`/`papers_used` with no marker distinguishing model-authored prose from
verified metadata, and the synthesis prompts treated retrieved chunks as trusted
canon — so a poisoned preprint's embedded instructions could steer the report
(addresses #1).

- Add a non-breaking `_provenance` envelope to both report payloads: provider,
  model, rag_cycles_executed, ai_generated_fields=["report"], and
  untrusted_sources (all retrieved sources are attacker-influenceable).
- Wrap retrieved chunk bodies in [UNTRUSTED_DOCUMENT] markers (citation headers
  stay outside) and append a clause to the system/mandatory prompts instructing
  the model to treat that content strictly as data, never as instructions.

The verbatim v1 MANDATORY_PROMPT_TEMPLATE is left untouched; the `report` field
is not renamed (would break ASB/Scriptorium/Mimosa consumers).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
claims_to_graph emitted every qualifier on asb:qualifier, but that slot is a
closed Bucur enum in indicium — domain-adapter qualifiers (e.g. aligned_with)
failed SHACL. Route by membership: core Bucur terms stay on asb:qualifier;
domain terms go on the open asb:domainQualifier slot added in indicium 1.7.0.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Tags each claim with verify_quote status, emits a JSONL audit sidecar, and
applies strict (drop unverified) vs fail-open (keep tagged) policy.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…nt path

Addresses B1 review: guard the candidates comprehension against non-dict
passages (the one unguarded line that could raise), document in-place mutation
and the strict-ignored-when-verifier-unavailable behavior, and add a test that
simulates the indicia extra being absent (degrades all to unverified, kept).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replaces the positional chunk_idx bug (builder.py) with verify_quote's
matched_index; writes quote_exact + oa:TextQuoteSelector on verified/repaired
evidence and only an anchorStatus marker on unverified (no false selector).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The test bound a graph for "kbAnchor" without redirecting the on-disk
manifest, so after the first run the persisted manifest marked the paper
unchanged and subsequent runs skipped extraction (empty graph -> assertion
failed). Monkeypatch manifest._DATA_DIR to tmp_path, matching every other
build_claim_graph test. Also sort the new imports and drop the now-unused
enumerate index left by the positional->content-match rewrite.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ig to builder

annotate_anchor_status tags each extracted record (verified/repaired/
unverified/unchecked) before the license rewrite; MCP build_claim_graph now
passes config.anchor.* (strict/near_threshold/audit_dir) into the builder.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The build_claim_graph MCP tool now reads config.anchor.* (R3), so its unit
test must give the fake MCPState a real Config (was None). Also moves the
type-only pathlib.Path import under TYPE_CHECKING and drops a stale noqa in
anchor.py.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Drives build_claim_graph with a fabricated quote (verbatim in no passage) and
asserts the claim is kept (fail-open) but tagged unverified with zero
quoteExact / oa:TextQuoteSelector triples — pinning the no-laundering
invariant at the builder layer where the selector logic lives.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wire the MCP claim-extraction tool through the R3 anchor invariant: after
extract_claims, run anchor_claims so every returned claim carries an _anchor
record {status, matched_index, quote_exact, score}. quote_exact is populated
only for verified/repaired claims — never fabricated for an unverified quote
(no-laundering holds at the tool layer, not just the builder).

Also fix a latent blank-passage bug: get_relevant_passages emits flat
{text, source_doi}, but extract_claims and anchor_claims both read
{chunk_text, source:{doi}}. Normalize once up front so the model sees real
passage text and the verifier gets real candidates — otherwise anchoring would
silently see empty text and tag everything unverified.

Pin indicium>=1.8.0 (the floor that ships the verify_quote kernel the anchor
path imports) and add two unit tests: a verbatim quote anchors verified with
quote_exact recovered; a fabricated quote is kept but tagged unverified with
quote_exact=None.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The retry predicate now requires isinstance(e, Exception) so that
asyncio.CancelledError (a BaseException, which tenacity otherwise catches)
propagates instead of being retried. Without this, every caller-side
asyncio.wait_for/timeout was ineffective — all 3 retry attempts ran their
full HTTP timeout before the stale deadline was noticed, so the structured
extraction tools blew past their 80s cap (observed 115-151s). The
extract_failure_modes / extract_parameters calls now return within the cap.

route_kbs gains a kb_names alias for candidate_kbs: every other multi-KB
tool names that argument kb_names, so agents reach for it first and the
mismatch previously errored the first call.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Flip the CI ruff gate from report-only (--exit-zero) to enforcing, pin
ruff to 0.15.13, and align CI + mypy to Python 3.12 (matches
requires-python). Add .pre-commit-config.yaml running the same pinned
ruff (lint + format-on-touched-files only, per the pyproject note).

Resolve all outstanding ruff findings via the established fix-vs-ignore
policy: fix real issues (B904 exception chaining, F401/F841 dead code,
RUF012 ClassVar, E731, UP035, SIM105, B905 zip strict=) and config-ignore
genuine false positives with justifications (E402, UP042, SIM102, SIM117,
TC001-3, RUF001-3, N806/N814, B008, plus test-idiom per-file ignores).

Also fix two regressions from over-eager autofixes surfaced by the suite:
- jobs/registry.py: SIM118 had rewritten `for k in row.keys()` on an
  aiosqlite.Row (iterating a Row yields values, not keys) -> IndexError.
  Restored .keys() with a justified noqa.
- cite_graph.py: restored openalex_id_for_doi (a test monkeypatch target)
  with a justified noqa: F401.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adopt a pragmatic-but-enforced mypy config: semantic bug-catchers stay on
(attr-defined, arg-type, return-value, index, assignment, ...) while
annotation-completeness rules stay off to avoid churn in a research codebase.
Drove all findings to zero across src/, fixing several real latent bugs
surfaced along the way: StreamEvent payloads passed a dict where a str was
required (now json.dumps), a wrong base64.binascii.Error reference, non-indexable
Iterable[Row] from execute_fetchall, a referenced-but-deleted except variable,
and a missing DOI None-guard. Added `mypy src/` to the CI lint job so the gate
holds going forward.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Audit of bare `except Exception` swallows: narrow defensive parse/nav
handlers to precise exception types (KeyError/TypeError/AttributeError,
ValueError, json.JSONDecodeError, ImportError) so unexpected errors
propagate instead of being silently eaten; keep best-effort side-effect
handlers broad but add logger.debug observability. Behavior-preserving:
identical control flow, no new suppression.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Move the seven implementation helpers (_start_mcp_and_web, _run_query,
the bibtex KB builders, _build_app_state_for_cli, and the ingest summary
printers) out of the 2000-line cli.py into a dedicated cli_helpers module,
re-importing them so perspicacite.cli.<name> stays a valid monkeypatch
target for the existing unit tests. First slice of the god-module split;
the larger server.py/orchestrator splits remain a dedicated follow-up.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Reconciles the hardening work (ruff/mypy CI gate + cli_helpers
decomposition) with what landed on main after the branch was cut:
the R3 anchor invariant (PR #10) and the claims markdown-fence fix
(PR #9). The anchor logic was already present on this branch via its
feat/domain-qualifier-routing base, so it reconciles cleanly; PR #9's
_strip_markdown_fence is the net new content.

Conflicts (both resolved keeping BOTH anchor logic and lint cleanups):
- src/perspicacite/indicium_layer/builder.py: dropped unused
  INDICIUM_NS import (kept the W1 F401 cleanup; anchor code unaffected).
- tests/unit/test_claims_validation.py: kept the isort blank line
  separating third-party from first-party imports.

Verified on the merged tree: ruff clean, mypy clean (241 files),
2487 passed / 1 skipped.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CI's lint and test jobs install base deps only (no `indicia` extra), so
rdflib/pyoxigraph/indicium are absent there while present locally. Two
pre-existing failures surfaced after the main merge:

- mypy lint: the ignore_missing_imports override omitted rdflib/pyoxigraph,
  and store.py aliased `ConjunctiveGraph = Dataset` — a local assignment that
  is invalid as a type annotation. Add both modules to the override list,
  drop the alias (use Dataset directly), and disable warn_unused_ignores for
  indicium_layer.store where the rdflib union-attr ignore is env-dependent.
- smoke test: _TOOL_ARGS lacked entries for 8 web/skill/claim-graph tools, so
  test_tool_inventory called them with no args and hit TypeErrors. Supply
  minimal valid args; claim-graph tools short-circuit cleanly at `import
  indicium` in CI.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…etry test

CI installs only ".[dev]" via plain pip, which cannot resolve the local-path
`indicium` source, so indicium/rdflib/pyoxigraph/indicium_adapters are never
present in CI. Guard every claim-graph / verify-kernel test with
pytest.importorskip so they skip cleanly there while still running locally
under `uv sync --extra indicia`. Mixed modules use per-test guards to preserve
their non-indicia coverage rather than skipping the whole module.

Also deselect test_domain_aggregator::test_retry_counts_as_one_circuit_failure
from the safe subset: it exercises real exponential-backoff retry sleeps
(~27s) and exceeds the 15s safe-subset timeout. It passes under a longer
timeout and is unrelated to the indicia extra.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…on links

Move the 18 root-level config files into a tidy config/ tree to declutter
the repo root:
  - config/example.yml                (was config.example.yml)
  - config/providers/*.yml            (6 provider presets, was config.*.example.yml)
  - config/embeddings/*.yml           (11 model presets, was config_*.yml)

Names are cleaned (drop config./config_ prefix and .example infix,
underscores -> hyphens). All references updated across docs, README,
help/error strings (agent_cli, claude_cli, check_install), and tests.
The config-audit test glob now points at config/providers + config/example.

Also add proper citation links to the README (CEUR-WS PDF + HAL archive)
in both the human-readable block and the BibTeX entry, and extend
.gitignore with tooling caches (.mypy_cache, .pytest_cache, .coverage)
and local claim-graph stores.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- CI: enable actions/setup-python pip caching (keyed on pyproject.toml)
  for both the lint and test jobs to speed up dependency installs.
- Add .github/dependabot.yml: weekly grouped updates for pip and
  github-actions ecosystems.
- pyproject.toml: add readme, authors, keywords, trove classifiers, and
  [project.urls] (Homepage/Repository/Issues/Changelog/Paper) for a
  release-quality package description.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Update inbound references to the relocated root files (INSTALL_AGENT.md,
MANUAL_QA.md) in setup.md, docs/development/testing.md, and docs/index.md,
and drop the now-stale root allowlist entries from .gitignore (the files
live under the already-allowlisted docs/ tree).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Watch frontend/package-lock.json (grouped, weekly) so the frontend's npm
dependencies get the same automated update treatment as the Python deps.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wire a third CI job for the Next.js frontend running lint, typecheck,
unit tests, and build with npm caching. Add eslint flat config (Next 16
core-web-vitals; React-Compiler-era react-hooks rules set to warn for a
green baseline) and a vitest suite covering the cost estimation helpers.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Remove the blanket warn-downgrade of the React-Compiler react-hooks rules
and address all 21 findings: genuinely unnecessary effects are refactored
(redundant setState removed; reset-on-change moved to render-time guards;
ChatPanel's cancel callback reordered above its consuming effect), and the
remaining legitimate effect patterns (async mount fetches, SSR-safe
localStorage hydration, intentional latches) carry per-line justified
disables. The four rules are back at "error" so future violations fail CI.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add a log line to broad exception handlers that previously swallowed
errors silently, so failures in optional-dep fallbacks, JSON parsing,
download workers, and background jobs surface in logs. Behavior is
preserved: control flow, return values, and re-raises are unchanged.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…chestrator

Completes the silent-handler audit for the two modules deferred from the
first pass. Each previously-silent broad except now logs one line (debug
for best-effort/fallback paths, warning where user-meaningful work is
dropped) before its existing fallback. Behavior is unchanged: control
flow, return values, and caught types are preserved.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant